Development of the RIKEN database for dynamic facial expressions with multiple angles

The development of facial expressions with sensing information is progressing in multidisciplinary fields, such as psychology, affective computing, and cognitive science. Previous facial datasets have not simultaneously dealt with multiple theoretical views of emotion, individualized context, or multi-angle/depth information. We developed a new facial database (RIKEN facial expression database) that includes multiple theoretical views of emotions and expressers’ individualized events with multi-angle and depth information. The RIKEN facial expression database contains recordings of 48 Japanese participants captured using ten Kinect cameras at 25 events. This study identified several valence-related facial patterns and found them consistent with previous research investigating the coherence between facial movements and internal states. This database represents an advancement in developing a new sensing system, conducting psychological experiments, and understanding the complexity of emotional events.

www.nature.com/scientificreports/week before the recording session to obtain individualized emotional events (Fig. 1).Qualtrics was applied as the platform for collecting the events.Participants were instructed to describe a single event corresponding to each cell in Fig. 1.The order of events (i.e., each cell in Fig. 1) for the valence and arousal combinations was randomized.Participants also rated appraisal checks from 1 (strongly disagree) to 5 (strongly agree) for novelty (predictable: "the event was predictable"; familiar: "the event was common"), goal significance ("the event was important to you") and coping potential ("the event could have been controlled and avoided if you had taken appropriate actions") for each event described.These appraisal checks were derived from a previously reported facial database that relied on the CPM as its theoretical basis 28 .The participants were also asked to describe the possible labels for each event freely.
On the day of the facial clip recording, the participants were given a further explanation of the experiment.They were transferred to the recording location (the first basement floor of the Advanced Telecommunications Research Institute International).Figure 2 displays the recording environment.The participants sat in chairs with their faces fixed in a steady position.We set up three photographic lights (AL-LED-SQA-W: Toshiba) and illuminated each participant's face from the upper right, left, and lower sides to make clear their faces and remove shadows.
We then asked the participants to remove their masks and glasses.We set up an environment to record facial movements using ten Azure Kinect DK1880 cameras with (Microsoft; 2D:1920 × 1080; depth:640 × 576) pixel resolution and 30 frames per second to record the participants' facial movements as video clips.The interval between the left and right horizontal cameras was 22.5°, and the cameras were used at 22.5°, 45°, and 90° (skipping 67.5° in this database).Images were taken to avoid interference between multiple depth cameras, with each camera shifted by 160 microseconds in timing.A Software Development Kit program was used to create a program to record facial movements.The depth information was limited to eight cameras to reduce the processing load and avoid equipment errors: one upper and two lower cameras, and front-facing, two left, and two right cameras; a green carpet covered as much of the background as possible (Fig. 2).
The experimenter verbally narrated the individual event descriptions collected one week before the recording session for each expression.Participants were instructed to vividly reexperience their emotions and practice expressing them through facial expressions using a hand mirror before the recording.Participants were allowed to remember the events and practice their facial expressions with no time restrictions.When participants felt ready, they sounded a bell to initiate the recording and the experimenter verbally narrated the events again.The recording process was structured into distinct segments.The timing of each segment was indicated by beep sounds (onset:880 Hz; peak:1174 Hz; offset:880 Hz) produced by the speaker system to control the participants in producing their expressions according to the time course.The models were instructed to express an emotional expression rooted in pre-described events for the initial 1 s, maintain the intended emotional expression for 2 s, and then return to a neutral expression for one second.The order of events was also randomized.
view have the highest accuracy in OpenFace, only facial expressions from the front-view camera were targeted in this study.Given the procedure's nature, in which facial combinations' intensity is expected to be maximal during the apex beep sound, we mainly focused on the middle frame (i.e., 61 frames).
We used R 65 for statistical analysis.We used the tm and openxlsx packages to perform text mining for each event [66][67][68] .The psych package 69 was used to check the correlation between several appraisal dimensions.We used the nnTensor package to reduce dimension for data extracted by OpenFace 70 .We used the tidyverse package for data visualization 71 .Based on ample psychophysiological evidence, we predicted and analyzed the relationships between valence and AUs 4/12 using hierarchical linear regression modeling using the lmerTest package 72 .The results were considered significant at p < 0.05.To elucidate the relationship between arousal and AUs, we used Bayesian Lasso regression 73 , treating arousal as the dependent variable and utilizing all AUs as independent variables with the tuning parameter set at a degree of freedom of 1 74 .The AU data were standardized, and only results that did not encompass zero within the 95% confidence interval were reported.All codes are available on the Gakunin RDM (https:// dmsgr dm.riken.jp: 5000/ uphvb/).The design and analysis of this study were not pre-registered.

Results
The detail of the events.As indicated in the "Methods" section, we obtained 1,200 events (48 participants × 5 valences × 5 arousals).All Japanese events were translated and back-translated into English using TEXT (https:// www.text-edit.com/ engli sh-page/).Table 1 depicts the top 3 frequently used English words for each event obtained by text mining.In the obtained database, words that appeared to be common events (frequency of 10/48 or more) occurred in valence 4 * arousal 4 events (friend) and valence 5 * arousal 5 events (passing the university entrance exam).The latter, in particular, shows that university entrance exams greatly affected emotional events because this research was limited to young participants.
The correspondence between the valence, arousal, and appraisal ratings is also presented in Table 2. Valence and arousal appeared to be positively associated with the appraisal of importance for each event (rs > 0.20).The www.nature.com/scientificreports/results also revealed that high predictability increased the valence of the event (r = 0.22).Additionally, the more unfamiliar the event, the higher the arousal (r = −0.19).For the correlations between appraisal dimensions, positive correlations were found between predictability and familiarity, and predictability and controllability (rs > 0.29).
As the participants were asked to freely describe the possible labels for each event, each event had an emotional term that the participant subjectively labelled.To provide information labelled as an individual event, Table 3 lists the most frequently used labels of emotions using the free description data.Only the top 18 modes (N = 730/1200) are listed.Positive emotional labels, such as joy, happiness, and fun, indicated high valence; negative emotional labels, such as anger, sadness, and unpleasantness, indicated low valence.Arousal was high with surprise and impatience.Although there were other interesting correspondences between the controllability component and frustration, predictability, and fun, these were not examined as they went beyond the study's purpose of overviewing the events in our database.

The detail of facial movements
The facial data for some events are missing due to camera malfunctions and participant problems, although the events themselves were recorded as stated above.The available number of frames was 142,865.When only the peak frame was extracted, there were 1,190 frames.Six events for male and four for female participants were missing expressions.Ultimately, 1,190 data points were available for analysis.Figure 3 shows the facial patterns of all the individual events associated with valence and arousal.Visual inspection revealed that AU4 (lower brow) and AU7 (lid tightener) were strongly expressed during negative events (V1-V2).Positive events (V4-V5) induced AU6 (cheek raiser), AU7, AU10 (upper lip raiser), AU12 (lip corner puller), and AU14 (dimpler), which can be considered strong smiling expressions.The intensity of facial movements may be relatively low in neutral events (V3) compared with the two valenced events.Moreover, in positive events, Fig. 3 indicates that higher arousal was associated with more mouth-opening movements (AU25: lip parts and AU26: jaw drop).In the peak intensity frame, we also checked the correlation between the estimated AUs and appraisal dimensions (Table 4).www.nature.com/scientificreports/Compared to the correlations between valence and some facial movements, such as AU12 (lip corner puller: r = 0.49), the combinations of all facial movements and other appraisal dimensions were relatively low (|r|s < 0.25).
A hierarchical linear regression model examined the relationship between valence/arousal and the AUs.Consistent with our predictions, the result indicated that the valence values significantly predicted the intensity of AU4 (brow lowerer) negatively (β = −0.11,t = 6.38, p < 0.001) and that of AU12 (lip corner puller) positively (β = 0.28, t = 12.64, p < 0.001).Besides, the arousal values significantly predicted the intensity scores of AU 12 (β = 0.10, t = 9.19, p < 0.001).Furthermore, post-hoc sensitivity power analysis using the simr package 75 indicated that the current sample size (i.e., N = 1190) was sufficient to detect all coefficients in the hierarchical linear regression models with a significance level of α = 0.05 and 99% power.
To explore the new relationship between arousal and the AUs, we also used the Bayesian Lasso regression.Action Units 12 (lip corner puller) and 25 (opening the mouth) were found to predict arousal (βs = 0.13, 95% Credible Intervals [0.01, 0.26] and 0.08, 95% Credible Intervals [0.00, 0.17]).However, none of the other predictors predicted arousal performance, resulting from 95% CIs that included zero.
We confirmed the dynamics of the facial expressions obtained in this database by applying non-negative matrix factorization to reduce dimensionality and extract spatiotemporal features 76 .This approach can identify dynamic facial patterns [77][78][79] .The factorization rank was determined using cophenetic coefficients 80 and the dispersion index 81 .Information on factorization rank is available on the Gakunin RDM (https:// dmsgr dm.riken.jp: 5000/ uphvb/).
Figure 4 displays the AU profiles of the top four components.We interpreted Component 1 as a Duchenne marker (AU6, 7), Component 2 as blinking and other facial movements (AU1, 14, 17, 45), Component 3 as a lower brow (AU4), and Component 4 as smiling (AU6, 10, 12, 14) by visually inspecting the relative contribution of each AU to the independent components.These results were also consistent with the peak intensities of each facial movement (Fig. 3).
Figure 5 lists how the spatial components changed over time for each valence and arousal combination.Visual inspection of component 1 (Duchenne marker) revealed that negative (V1-V2) and positive (V4-V5) events showed larger movements (e.g., V1A1 and V5A5).This result is consistent with the finding that eye constriction is systematically associated with the facial expressions of negative and positive emotions 82 .Component 2 (blinking and other facial movements) can be interpreted as the relaxation movement of tension associated with the expression of deliberate facial manipulation or noise unrelated to the main emotional expression because this movement increases during the offset duration (frames = 91-120) after the peak duration (frames = 31-90).For Component 3 (lower brow), negative expressions (V1-V2) produced more intense facial changes than other expressions (V3-V5).Component 4 (smiling) occurred more frequently during positive events (V4-V5) than others (V1-V3).
In summary, blinking and other facial movements, such as raising the inner eyebrow and chin, were (i.e., Component 2) peculiar to the offset of deliberate facial expressions in naive Japanese participants.More interestingly, the results clarified that smiling is related to the positive (Component 4), lowering of eyebrows is related to the negative (Component 3), and eye constriction (Component 1) corresponds to both values.
As a supplementary analysis and an example of the potential uses of the database, it may be useful to visualize dynamic changes rather than correlations in the peak frame (Table 4) as the relationship between one appraisal dimension and one facial pattern.According to Scherer's theory, the appraisals (and the corresponding AUs) appear sequencetially.Figure 6 shows the relationship between one appraisal dimension (important) and one   component (AU6, 10, 12, 14).This indicates that as the appraisal of the importance of an event increase, more smiles are seen in response to the event.

Discussion
This study developed a new facial database with expresser annotations such as individualized emotional events, appraisal checks, and free description labels with multi-angle and depth information.The results (Table 3) indicate that the words for each event had few matches, implying that the database has a large variance in emotional events.A database with various events and individual evaluations can be verified for academic purposes.For example, researchers can investigate issues such as the typical elements of events labeled as anger and the appraisal components that constitute them in a data-driven manner or as a starting point.
According to the analysis of front-view facial expressions, facial movements related to pleasant and unpleasant valences were observed.For example, lowering the brow was related to a negative valence, whereas pulling the lip corner was related to a positive valence.These results are consistent with previous findings investigating the coherence between valence and facial muscle electrical activity.Moreover, the Bayesian lasso analysis reported that mouth movements such as AU12 (pulling the lip corner) and AU25 (opening the mouth) were also associated with arousal.Opening the mouth has been shown to increase arousal attribution from observers 40 , which corresponds to ratings on the part of the expressers.This contributed to understanding the relationship between specific facial action and arousal.In addition to the data provided here, we are currently performing manual facial action coding by certified FACS coders.We will open the annotation data in the future (now, data for 8 people is already annotated and available in the same database.There are 29 types of manually annotated facial actions).In recent years, amidst the controversy about emotion 83 , there have been increasing efforts to extract facial movements 84 .The opening of databases, including manual FACS annotations that include in-depth information, can prime how research in affective computing can be further developed.
While this study provides a new facial database on emotions, certain limitations exist.First, the number of participants was small, given the diversity of facial movements and emotional events.In particular, the database only includes recordings of Japanese participants, which may limit its generalizability to other populations.Future research using similar environments, as represented in Fig. 2, will create additional databases for young and older adult participants and extend to other cultures or ethnicities beyond the Japanese population.Second, this study dealt with only facial responses to emotional expression.However, other aspects such as vocal or physiological responses would be important for understanding emotional communication [85][86][87] .Expansion of those modalities could provide a useful database to understand emotion further.Finally, we did not investigate how depth or infrared information can be used, and the lighting conditions do not influence this information compared to 2D color images.This database will be an important foundation for developing a robust sensing system for facial movements in room conditions.Using these databases, we provide an internal state estimation algorithm via an Application Programming Interface combined with smartphones and other devices.Furthermore, we would like to utilize this technology to develop solutions for people with difficulty communicating.
The database, including the expressers' events, labels, and appraisal checking intensity, is available as a RIKEN facial expression database for academic purposes.The notable features of this database are as follows: (a) availability of multiple theoretical views for emotion (valence and arousal, appraisal dimensions, and free emotion label), (b) variety of events, and (c) rich information taken from 10 multi-angle and depth cameras.

Figure 2 .
Figure 2. (A) The setup of the apparatus.The camera settings illustrated by Autodesk Fusion 360.Three lights were shone on the face from under the feet and from above on the left and right sides, and facial expressions were captured with one upper and two lower cameras, and front-facing, two left, and two right cameras.(B) Showcasing samples of the images.A green carpet covered as much of the background as possible.

Figure 4 .
Figure 4. Heatmap of each component's loadings for facial expressions of all events.Value colors represent each facial movement's contribution to component scores.

Figure 5 .
Figure 5. Temporal changes in the four components for the facial expressions of 25 events.

Figure 6 .
Figure 6.Temporal changes in the smiling pattern for the appraisal check of importance.

Table 1 .
Frequent words in each individualized event."…" means that there is more than one word with a frequency equivalent to the words listed in the table.V represents valence, and A represents arousal.The word 'get/got' was removed as it is used as a 'be' verb in Japanese.

Table 3 .
Eighteen emotional labels and corresponding valence, arousal, and appraisal checks.

Table 4 .
Correspondence between action units and valence and arousal values and appraisal ratings.